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Abstract 

Relation classification is an important re¬ 
search arena in the field of natural lan¬ 
guage processing (NLP). In this paper, we 
present SDP-LSTM, a novel neural net¬ 
work to classify the relation of two enti¬ 
ties in a sentence. Our neural architecture 
leverages the shortest dependency path 
(SDP) between two entities; multichan¬ 
nel recurrent neural networks, with long 
short term memory (LSTM) units, pick 
up heterogeneous information along the 
SDP. Our proposed model has several dis¬ 
tinct features: (1) The shortest dependency 
paths retain most relevant information (to 
relation classification), while eliminating 
irrelevant words in the sentence. (2) The 
multichannel LSTM networks allow ef¬ 
fective information integration from het¬ 
erogeneous sources over the dependency 
paths. (3) A customized dropout strategy 
regularizes the neural network to allevi¬ 
ate overfitting. We test our model on the 
SemEval 2010 relation classification task, 
and achieve an Ui-score of 83.7%, higher 
than competing methods in the literature. 

1 Introduction 


Relation classification is an important NLP task. 
It plays a key role in various scenarios, e.g., in- 


formation extraction ( 

Wu and Weld, 20101, ques- 

tion answering (Yao and Van Durme, 20141, med- 

ical informatics (Wang and Fan, 2014 

1, ontol- 

ogy learning (IXu et 

al., 20141, etc. The aim 


of relation classification is to categorize into pre¬ 
defined classes the relations between pairs of 
marked entities in given texts. For instance, in 
the sentence “A trillion gallons of [water] have 
been poured into an empty [region] gj of outer 

‘Corresponding authors. 


space,” the entities water and region are of rela¬ 
tion Entity-Destination(ei, 62). 

Traditional relation classification approaches 


rely largely on feature representation (Kambhatla, 


20041, or kernel design (Zelenko et al., 2003 


Bunescu and Mooney, 2005] ). The former method 
usually incorporates a large set of features; it is 
difficult to improve the model performance if the 
feature set is not very well chosen. The latter ap¬ 
proach, on the other hand, depends largely on the 
designed kernel, which summarizes all data infor¬ 
mation. Deep neural networks, emerging recently, 
provide a way of highly automatic feature learn¬ 


ing (Bengio et al., 20131, and have exhibited con¬ 


siderable potential (Zeng et al., 2014 Santos et 
al., 2015| |. However, human engineering—that is, 

incorporating human knowledge to the network’s 
architecture—is still important and beneficial. 

This paper proposes a new neural network, 
SDP-LSTM, for relation classification. Our model 
utilizes the shortest dependency path (SDP) be¬ 
tween two entities in a sentence; we also design a 
long short term memory (LSTM)-based recurrent 
neural network for information processing. The 
neural architecture is mainly inspired by the fol¬ 
lowing observations. 

• Shortest dependency paths are informative 
( [Fundel et al., 2007j |Chen et al, 2014| ). To 
determine the two entities’ relation, we find it 
mostly sufficient to use only the words along 
the SDP: they concentrate on most relevant 
information while diminishing less relevant 
noise. Figure [T] depicts the dependency parse 
tree of the aforementioned sentence. Words 
along the SDP form a trimmed phrase (gal¬ 
lons of water poured into region) of the orig¬ 
inal sentence, which conveys much informa¬ 
tion about the target relation. Other words, 
such as a, trillion, outer space, are less infor¬ 
mative and may bring noise if not dealt with 
properly. 


























• Direction matters. Dependency trees are a 
kind of directed graph. The dependency re¬ 
lation between into and region is PREP; such 
relation hardly makes any sense if the di¬ 
rected edge is reversed. Moreover, the enti¬ 
ties’ relation distinguishes its directionality, 
that is, r(a, b) differs from r{b, a), for a same 
given relation r and two entities a, b. There¬ 
fore, we think it necessary to let the neu¬ 
ral model process information in a direction- 
sensitive manner. Out of this consideration, 
we separate an SDP into two sub-paths, each 
from an entity to the common ancestor node. 
The extracted features along the two sub¬ 
paths are concatenated to make final classi¬ 
fication. 

• Linguistic information helps. For exam¬ 
ple, with prior knowledge of hyponymy, we 
know “water is a kind of substance.” This 
is a hint that the entities, water and region, 
are more of Entity-Destination rela¬ 
tion than, say, Communication-Topic. 
To gather heterogeneous information along 
SDP, we design a multichannel recurrent neu¬ 
ral network. It makes use of information 
from various sources, including words them¬ 
selves, POS tags, WordNet hypernyms, and 
the grammatical relations between governing 
words and their children. 

For effective information propagation and inte¬ 
gration, our model leverages LSTM units during 
recurrent propagation. We also customize a new 
dropout strategy for our SDP-LSTM network to 
alleviate the problem of overfitting. To the best 
of our knowledge, we are the first to use LSTM- 
based recurrent neural networks for the relation 
classification task. 

We evaluate our proposed method on the 
SemEval 2010 relation classification task, and 
achieve an Fi-score of 83.7%, higher than com¬ 
peting methods in the literature. 

In the rest of this paper, we review related work 
in Section]^ In Sectionwe describe our SDP- 
LSTM model in detail. Section [^presents quan¬ 
titative experimental results. Finally, we have our 
conclusion in Section |5] 

2 Related Work 

Relation classification is a widely studied task 
in the NLP community. Various existing meth- 
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Figure 1: The dependency parse tree correspond¬ 
ing to the sentence “A trillion gallons of water 
have been poured into an empty region of outer 
space.” Red lines indicate the shortest dependency 
path between entities water and region. An edge 
a ^ b refers to a being governed by b. Depen¬ 
dency types are labeled by the parser, but not pre¬ 
sented in the figure for clarity. 


ods mainly fall into three classes: feature-based, 
kernel-based, and neural network-based. 

In feature-based approaches, different sets of 
features are extracted and fed to a chosen classifier 
(e.g., logistic regression). Generally, three types of 
features are often used. Lexical features concen¬ 
trate on the entities of interest, e.g., entities per se, 
entity POS, entity neighboring information. Syn¬ 
tactic features include chunking, parse trees, etc. 
Semantic features are exemplified by the concept 


hierarchy, entity class, entity mention. Kamb- 


hatla (20041 uses a maximum entropy model to 


combine these features for relation classification. 
However, different sets of handcrafted features are 
largely complementary to each other (e.g., hyper¬ 
nyms versus named-entity tags), and thus it is hard 


to improve performance in this way (Zhou et ah. 


20051. 


Kernel-based approaches specify some measure 
of similarity between two data samples, with¬ 
out explicit feature representation. [Zelenko "et 


al. (20031 compute the similarity of two trees by 


utilizing their common subtrees. [Bunescu and 


Mooney (2005|l propose a shortest path depen¬ 


dency kernel for relation classification. Its main 
idea is that the relation strongly relies on the de¬ 


pendency path between two given entities. Wang 


(2008 |l provides a systematic analysis of several 


kernels and show that relation extraction can bene- 




















fit from combining convolution kernel and syntac¬ 
tic features. [Plank and Moschitti (2013[ ) introduce 
semantic information into kernel methods in ad¬ 
dition to considering structural information only. 
One potential difficulty of kernel methods is that 
all data information is completely summarized by 
the kernel function (similarity measure), and thus 
designing an effective kernel becomes crucial. 

Deep neural networks, emerging recently, can 
learn underlying features automatically, and have 


attracted growing interest in the literature. Socher 
et al. ( 2011 ) propose a recursive neural network 
(RNN) along sentences’ parse trees for sentiment 
analysis; such model can also be used to clas¬ 


sify relations (Socher et al., 2012). Hashimoto et 
al. (2013] ) explicitly weight phrases’ importance 
in RNNs to improve performance. Ebrahimi and 


Dou (20151 rebuild an RNN on the dependency 


path between two marked entities. Zeng et al. 


(20141 explore convolutional neural networks, by 
which they utilize sequential information of sen¬ 
tences. [Santos et al. (2015] l also use the convo¬ 
lutional network; besides, they propose a ranking 
loss function with data cleaning, and achieve the 
state-of-the-art result in SemEval-2010 Task 8 . 

In addition to the above studies, which mainly 
focus on relation classification approaches and 
models, other related research trends include in¬ 
formation extraction from Web documents in a 


semi-supervised manner (Bunescu and Mooney, 
2007 [Banko et al., 2007 1, dealing with small 
datasets without enough labels by distant super¬ 
vision techniques (Mintz et al., 2009|, etc. 


3 The Proposed SDP-LSTM Model 

In this section, we describe our SDP-LSTM model 
in detail. Subsection [3T] delineates the overall ar¬ 
chitecture of our model. Subsection [3.2| presents 
the rationale of using SDPs. Four different infor¬ 
mation channels along the SDP are explained in 
Subsection 13.31 Subsection 13.41 introduces the re¬ 
current neural network with long short term mem¬ 
ory, which is built upon the dependency path. Sub¬ 
section 3.5 customizes a dropout strategy for our 


network to alleviate overfitting. We finally present 
our training objective in Subsection[3.6| 


3.1 Overview 

Figure depicts the overall architecture of our 
SDP-LSTM network. 

First, a sentence is parsed to a dependency tree 


by the Stanford parser^ the shortest dependency 
path (SDP) is extracted as the input of our net¬ 
work. Along the SDP, four different types of 
information—referred to as channels —are used, 
including the words, POS tags, grammatical rela¬ 
tions, and WordNet hypernyms. (See Figure [^.) 
In each channel, discrete inputs, e.g., words, are 
mapped to real-valued vectors, called embeddings, 
which capture the underlying meanings of the in¬ 
puts. 

Two recurrent neural networks (Figure]^) pick 
up information along the left and right sub-paths 
of the SDP, respecitvely. (The path is separated by 
the common ancestor node of two entities.) Long 
short term memory (LSTM) units are used in the 
recurrent networks for effective information prop¬ 
agation. A max pooling layer thereafter gathers 
information from LSTM nodes in each path. 

The pooling layers from different channels are 
concatenated, and then connected to a hidden 
layer. Finally, we have a softmax output layer for 
classification. (See again Figure]^.) 


3.2 The Shortest Dependency Path 

The dependency parse tree is naturally suitable for 
relation classification because it focuses on the ac¬ 


tion and agents in a sentence (Socher et al., 20141. 
Moreover, the shortest path between entities, as 
discussed in Section [T] condenses most illuminat¬ 
ing information for entities’ relation. 

We also observe that the sub-paths, separated by 
the common ancestor node of two entities, provide 
strong hints for the relation’s directionality. Take 
Figure [T] as an example. Two entities water and 
region have their common ancestor node, poured, 
which separates the SDP into two parts: 


[water] —)■ of —)■ gallons —)■ poured 


and 

poured into [region] g^ 

The first sub-path captures information of ei, 
whereas the second sub-path is mainly about 
62 . By examining the two sub-paths sepa¬ 
rately, we know ei and 62 are of relation 
Entity-Destination(ei, 62), rather than 
Entity-Destination(e2, ei). 

Following the above intuition, we design 
two recurrent neural networks, which propagate 

’ http://nlp.stanford.edu/software/lex-parser.shtml 































Figure 2: (a) The overall architecture of SDP-LSTM. (b) One channel of the recurrent neural networks 
built upon the shortest dependency path. The channels are words, part-of-speech (POS) tags, grammatical 
relations (abbreviated as GR in the figure), and WordNet hypemyms. 


bottom-up from the entities to their common an¬ 
cestor. In this way, our model is direction- 
sensitive. 


3.3 Channels 

We make use of four types of information along 
the SDP for relation classification. We call them 
channels as these information sources do not inter¬ 
act during recurrent propagation. Detailed channel 
descriptions are as follows. 


Word representations. Each word in a given 
sentence is mapped to a real-valued vector by 
looking up in a word embedding table. Un- 
supervisedly trained on a large corpus, word 
embeddings are thought to be able to well 
capture words’ syntactic and semantic infor¬ 
mation dMikolov et ah, 2013b|). 


• Part-of-speech tags. Since word embed¬ 
dings are obtained on a generic corpus of a 
large scale, the information they contain may 
not agree with a specific sentence. We deal 
with this problem by allying each input word 
with its POS tag, e.g., noun, verb, etc. 
In our experiment, we only take into use a 
coarse-grained POS category, containing 15 
different tags. 


• Grammatical relations. The dependency 
relations between a governing word and its 
children makes a difference in meaning. A 
same word pair may have different depen¬ 
dency relation types. For example, “beats 

nsubj . „ . ... r r.Li dobj . „ 

- > It IS distinct from beats -)■ it. 

Thus, it is necessary to capture such gram¬ 


matical relations in SDPs. In our experi¬ 
ment, grammatical relations are grouped into 
19 classes, mainly based on a coarse-grained 
classification (|De Marneffe et ah, 2006|). 


WordNet hypernyms. As illustrated in Sec¬ 
tion [T| hyponymy information is also useful 
for relation classification. (Details are not re¬ 
peated here.) To leverage WordNet hyper- 


nyms, we use a to ol developed by Ciaramita 
and Altun (2006[ )p1 The tool assigns a hy- 


pemym to each word, from 41 predefined 
concepfs in WordNet, e.g., noun, food, 
verb, mot ion, etc. Given its hypernym, 
each word gains a more abstract concept, 
which helps to build a linkage between dif¬ 
ferent but conceptual similar words. 


As we can see, POS tags, grammatical rela¬ 
tions, and WordNet hypemyms are also discrete 
(like words per se). However, no prevailing em¬ 
bedding learning method exists for POS tags, say. 
Hence, we randomly initialize their embeddings, 
and tune them in a supervised fashion during train¬ 
ing. We notice that these information sources con¬ 
tain much fewer symbols, 15, 19, and 41, than the 
vocabulary size (greater than 25,000). Hence, we 
believe our strategy of random initialization is fea¬ 
sible, because they can be adequately tuned during 
supervised training. 

3.4 Recurrent Neural Network with Long 
Short Term Memory Units 

The recurrent neural network is suitable for mod¬ 
eling sequential data by nature, as it keeps a hid- 

^http://sourceforge.net/projects/supersensetag 
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introduced by [Zaremba and Sutskever (2014j ), also 
used in |Zhu et al. (2014| ). 

Concretely, the LSTM-based recurrent neural 
network comprises four components: an input gate 
it, a forget gate ft, an output gate ot, and a mem¬ 
ory cell Ct (depicted in Figure and formalized 
through Equations [T]-[^ as bellow). 

The three adaptive gates it, ft, and Ot depend 
on the previous state ht-i and the current input 
Xt (Equations An extracted feature vector 

gt is also computed, by Equation]^ serving as the 
candidate memory cell. 


Eigure 3: A long short term memory unit. h\ hid¬ 
den unit, c: memory cell, i: input gate. /: for¬ 
get gate, o: output gate, g: candidate cell, (g): 
element-wise multiplication. activation func¬ 
tion. 


den state vector h, which changes with input data 
at each step accordingly. We use the recurrent net¬ 
work to gather information along each sub-path in 
the SDP (Eigure l^). 

The hidden state ht, for the f-th word in the 
sub-path, is a function of its previous state ht-i 
and the current word Xt. Traditional recurrent net¬ 
works have a basic interaction, that is, the input is 
linearly transformed by a weight matrix and non- 
linearly squashed by an activation function. Eor- 
mally, we have 

ht = fiWinXt -h Wrecht-l + 


where Win and Wrec are weight matrices for the 
input and recurrent connections, respectively, bh 
is a bias term for the hidden state vector, and fh a 
non-linear activation function (e.g., tanh). 

One problem of the above model is known 
as gradient vanishing or exploding. The train¬ 
ing of neural networks requires gradient back- 
propagation. If the propagation sequence (path) is 
too long, the gradient may probably either grow, or 
decay, exponentially, depending on the magnitude 
of Wrec- This leads to the difficulty of training. 

Eong short term memory (ESTM) units are pro¬ 
posed in Hochreiter (19981 to overcome this prob¬ 
lem. The main idea is to introduce an adaptive gat¬ 
ing mechanism, which decides the degree to which 
ESTM units keep the previous state and memo¬ 
rize the extracted features of the current data in¬ 
put. Many ESTM variants have been proposed in 
the literature. We adopt in our method a variant 


it = a{Wi-Xt + Ui-ht-i + bi) ( 1 ) 

ft = a(Wf-Xt + Uf-ht_i + bf) (2) 

Ot = a{Wo-Xt + Uo-ht-i + bo) (3) 

gt = tanh{Wg-xt + Ug-ht-i + bg) (4) 

The current memory cell Ct is a combination of 
the previous cell content Ct-i and the candidate 
content gt, weighted by the input gate it and forget 
gate ft, respectively. (See Equationj^below.) 

Ct = it^gt + ft® Cf-I (5) 


The output of ESTM units is the the recur¬ 
rent network’s hidden state, which is computed by 
Equationj^as follows. 

ht = Ot® tanh(ci) (6) 


In the above equations, a denotes a sigmoid 
function; ® denotes element-wise multiplication. 

3.5 Dropout Strategies 

A good regularization approach is needed to al¬ 
leviate overfitting. Dropout, proposed recently 


by Hinton et al. (2012), has been very successful 
on feed-forward networks. By randomly omitting 
feature detectors from the network during train¬ 
ing, it can obtain less interdependent network units 
and achieve better performance. However, the 
conventional dropout does not work well with re¬ 
current neural networks with ESTM units, since 
dropout may hurt the valuable memorization abil¬ 
ity of memory units. 

As there is no consensus on how to drop 
out ESTM units in the literature, we try several 
dropout strategies for our SDP-ESTM network: 


Dropout embeddings; 


• Dropout inner cells in memory units, includ¬ 
ing it, gt, Ot, Ct, and hp, and 
















• Dropout the penultimate layer. 


4 Experiments 


As we shall see in Section |4.2[ dropping out 
LSTM units turns out to be inimical to our model, 
whereas the other two strategies boost in perfor¬ 
mance. 

The following equations formalize the dropout 
operations on the embedding layers, where D de¬ 
notes the dropout operator. Each dimension in the 
embedding vector, Xt, is set to zero with a prede¬ 
fined dropout rate. 


it = a{Wi-D{xt)+ Ui-ht-i+ bi) (7) 

ft = a(WfD{xt) + Ufht-i + bf) ( 8 ) 

Ot = a{Wo-D{xt) + Uo-ht-i + bo) (9) 

gt = tanh(Wg-D{xt) + Ug-ht-i + bg'j (10) 

3.6 Training Objective 

The SDP-LSTM described above propagates in¬ 
formation along a sub-path from an entity to the 
common ancestor node (of the two entities). A 
max pooling layer packs, for each sub-path, the 
recurrent network’s states, h’s, to a fixed vector 
by taking the maximum value in each dimension. 

Such architecture applies to all channels, 
namely, words, POS tags, grammatical relations, 
and WordNet hypernyms. The pooling vectors in 
these channels are concatenated, and fed to a fully 
connected hidden layer. Finally, we add a softmax 
output layer for classification. The training objec¬ 
tive is the penalized cross-entropy error, given by 

J = - logyj+Al ^ ^ 

i=l \i=l i=l 



where t G is the one-hot represented ground 
truth and y G M"'" is the estimated probability for 
each class by softmax. {ric is the number of target 
classes.) || • ||i? denotes the Frobenius norm of a 
matrix; ui and v are the numbers of weight matri¬ 
ces (for W’s and U’s, respectively). A is a hyper¬ 
parameter that specifies the magnitude of penalty 
on weights. Note that we do not add £2 penalty to 
biase parameters. 

We pretrained word embeddings by word 2 vec 
(Mikolov et ah, 2013al on the English Wikipedia 
corpus; other parameters are initialized randomly. 
We apply stochastic gradient descent (with mini¬ 
batch 10 ) for optimization; gradients are computed 
by standard back-propagation. Training details are 
further introduced in Section l4~2l 


In this section, we present our experiments in de¬ 


tail. Our implementation is built upon Mou et al. 


(20151. Section|4~T]introduces the dataset; Section 


4.2 describes hyperparameter settings. In Section 


4.3 we compare SDP-FSTM’s performance with 


other methods in the literature. We also analyze 
the effect of different channels in Section |4^ 


4.1 Dataset 


The SemEval-2010 Task 8 dataset is a widely used 
benchmark for relation classification ([Hendrickx 


et ah, 2010 1 . The dataset contains 8,000 sentences 


for training, and 2,717 for testing. We split 1/10 
samples out of the training set for validation. 

The target contains 19 labels: 9 directed rela¬ 
tions, and an undirected Other class. The di¬ 
rected relations are list as below. 


• Cause-Effect 

• Component-Whole 

• Content-Container 

• Entity-Destination 

• Entity-Origin 

• Message-Topic 

• Member-Collection 

• Instrument-Agency 

• Product-Producer 

In the following are illustrated two sample sen¬ 
tences with directed relations. 


[People] ej have been moving back into 

[downtown] 62 - 

Financial [stress] ej is one of the main 
causes of [divorce] ej- 

The target labels are Entity-Destination 
( 61 , 62 ), and Cause-Ef f ect(ei, 62 ), respec¬ 
tively. 

The dataset also contains an undirected Other 
class. Hence, there are 19 target labels in total. 
The undirected Other class takes in entities that 
do not fit into the above categories, illustrated by 
the following example. 

A misty [ridge] gj uprises from the 
[surge] 62 - 

We use the official macro-averaged Fi-score to 
evaluate model performance. This official mea¬ 
surement excludes the Other relation. Nonethe¬ 
less, we have no special treatment of Other class 
in our experiments, which is typical in other stud¬ 
ies. 























Figure 4: Fi-scores versus dropout rates. We first evaluate the effect of dropout embeddings (a). Then 
the dropout of the inner cells (b) and the penultimate layer (c) is tested with word embeddings being 
dropped out by 0.5. 


4.2 Hyperparameters and Training Details 


This subsection presents hyperparameter tuning 
for our model. We set word-embeddings to 
be 200-dimensional; POS, WordNet hyponymy, 
and grammatical relation embeddings are 50- 
dimensional. Each channel of the LSTM network 
contains the same number of units as its source 
embeddings (either 200 or 50). The penultimate 
hidden layer is 100-dimensional. As it is not fea¬ 
sible to perform full grid search for all hyperpa¬ 
rameters, the above values are chosen empirically. 

We add ^2 penalty for weights with coefficient 
10 “®, which was chosen by validation from the set 
{ 10 -^ 10 -^••• , 10 -’^}. 


We thereafter validate the proposed dropout 
strategies in Section 3.5 Since network units in 
different channels do not interact with each other 
during information propagation, we herein take 
one channel of LSTM networks to assess the ef¬ 
ficacy. Taking the word channel as an example, 
we first drop out word embeddings. Then with a 
fixed dropout rate of word embeddings, we test the 
effect of dropping out LSTM inner cells and the 
penultimate units, respectively. 

We find that, dropout of LSTM units hurts the 
model, even if the dropout rate is small, 0.1, 
say (Ligure [^). Dropout of embeddings im¬ 
proves model performance by 2.16% (Ligure |^); 
dropout of the penultimate layer further improves 
by 0.16% (Ligure|^). This analysis also provides, 
for other studies, some clues for dropout in LSTM 
networks. 


4.3 Results 

Table 4 compares our SDT-LSTM with other state- 
of-the-art methods. The first entry in the ta¬ 


ble presents the highest performance achieved by 


traditional feature engineering. Hendrickx et al. 


(2010|) leverage a variety of handcrafted features. 


and use SVM for classification; they achieve an 
Fi-score of 82.2%. 

Neural networks are first used in this task in 


Socher et al. (2012). They build a recursive neural 


network (RNN) along a constituency tree for re¬ 
lation classification. They extend the basic RNN 
with matrix-vector interaction and achieve an Fi- 
score of 82.4%. 


Zeng et al. (20141 treat a sentence as sequen¬ 


tial data and exploit the convolutional neural net¬ 
work (CNN); they also integrate word position in¬ 
formation into their model. [Santos et al. (2015] ) 
design a model called CR-CNN; they propose a 
ranking-based cost function and elaborately di¬ 
minish the impact of the Other class, which is 
not counted in the official Fi-measure. In this way, 
they achieve the state-of-the-art result with the Fi- 
score of 84.1%. Without such special treatment, 
their Fi-score is 82.7%. 


Yu et al. (20141 propose a Leature-rich Com¬ 


positional Embedding Model (LCM) for relation 
classification, which combines unlexicalized lin¬ 
guistic contexts and word embeddings. They 
achieve an Fi-score of 83.0%. 


Our proposed SDT-LSTM model yields an Fi- 
score of 83.7%. It outperforms existing compet¬ 
ing approaches, in a fair condition of softmax with 
cross-entropy error. 

It is worth to note that we have also conducted 
two controlled experiments: (1) Traditional RNN 
without LSTM units, achieving an Fi-score of 
82.8%; (2) LSTM network over the entire depen¬ 
dency path (instead of two sub-paths), achieving 

















Classifier 

Feature set 

Fi 


POS, WordNet, prefixes and other morphological features. 


SVM 

depdency parse. Levin classes, PropBank, LanmeNet, 
NomLex-Plus, Google n-gram, paraphrases, TextRunner 

82.2 

RNN 

Word embeddings 

74.8 

Word embeddings, POS, NER, WordNet 

77.6 

MVRNN 

Word embeddings 

79.1 

Word embeddings, POS, NER, WordNet 

82.4 

CNN 

Word embeddings 

69.7 

Word embeddings, word position embeddings, WordNet 

82.7 

Chain CNN 

Word embeddings, POS, NER, WordNet 

82.7 

LCM 

Word embeddings 

80.6 

Word embeddings, depedency parsing, NER 

83.0 


Word embeddings 

82.8t 

CR-CNN 

Word embeddings, position embeddings 

82.7 


Word embeddings, position embeddings 

84.lt 


Word embeddings 

82.4 

SDP-LSTM 

Word embeddings, POS embeddings, WordNet embeddings. 

83.7 


grammar relation embeddings 


Table 1: Comparison of relation classification systems. The “f” remark refers to special treatment for 
the Other class. 


an Fi-score of 82.2%. These results demonstrate 
the effectiveness of LSTM and directionality in re¬ 
lation classification. 

4.4 Effect of Different Channels 

This subsection analyzes how different channels 
affect our model. We first used word embeddings 
only as a baseline; then we added POS tags, gram¬ 
matical relations, and WordNet hypemyms, re¬ 
spectively; we also combined all these channels 
into our models. Note that we did not try the latter 
three channels alone, because each single of them 
(e.g., POS) does not carry much information. 

We see from Table that word embeddings 
alone in SDP-LSTM yield a remarkable perfor¬ 
mance of 82.35%, compared with CNNs 69.7%, 
RNNs 74.9-79.1%, and FCM 80.6%. 

Adding either grammatical relations or Word- 
Net hypernyms outperforms other existing meth¬ 
ods (data cleaning not considered here). POS tag¬ 
ging is comparatively less informative, but still 
boosts the Fi-score by 0.63%. 

We notice that, the boosts are not simply added 
when channels are combined. This suggests that 
these information sources are complementary to 
each other in some linguistic aspects. Nonethe¬ 
less, incorporating all four channels further pushes 
the Fi-score to 83.70%. 


Channels Fi 

Word embeddings 82.35 

-|- POS embeddings (only) 82.98 

-I- GR embeddings (only) 83.21 

-|- WordNet embeddings (only) 83.03 


-|- POS -|- GR -|- WordNet embeddings 83.70 
Table 2: Effect of different channels. 

5 Conclusion 

In this paper, we propose a novel neural network 
model, named SDP-LSTM, for relation classifi¬ 
cation. It learns features for relation classifica¬ 
tion iteratively along the shortest dependency path. 
Several types of information (word themselves, 
POS tags, grammatical relations and WordNet hy¬ 
pemyms) along the path are used. Meanwhile, 
we leverage LSTM units for long-range infor¬ 
mation propagation and integration. We demon¬ 
strate the effectiveness of SDP-LSTM by evalu¬ 
ating the model on SemEval-2010 relation clas¬ 
sification task, outperforming existing state-of-art 
methods (in a fair condition without data clean¬ 
ing). Our result sheds some light in the relation 
classification task as follows. 

• The shortest dependency path can be a valu¬ 
able resource for relation classification, cov¬ 
ering mostly sufficient information of target 





















relations. 

• Classifying relation is a ehallenging task due 
to the inherent ambiguity of natural lan¬ 
guages and the diversity of sentenee expres¬ 
sion. Thus, integrating heterogeneous lin- 
guistie knowledge is benefieial to the task. 

• Treating the shortest dependeney path as two 
sub-paths, mapping two different neural net¬ 
works, helps to eapture the direetionality of 
relations. 

• LSTM units are effeetive in feature detee- 
tion and propagation along the shortest de¬ 
pendeney path. 
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